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Abstract 

Testing and characterizing the difference between two data samples is of fundamen- 
tal interest in statistics. Existing methods such as Kolmogorov-Smirnov and Cramer- 
von-Mises tests do not scale well as the dimensionality increases and provide no easy 
way to characterize the difference should it exist. In this work, we propose a theoret- 
ical framework for inference that addresses these challenges in the form of a prior for 
Bayesian nonpar ametric analysis. The new prior is constructed based on a random- 
partition-and-assignment procedure similar to the one that defines the standard op- 
tional Polya tree distribution, but has the ability to generate multiple random distribu- 
tions jointly. These random probability distributions are allowed to "couple" , that is to 
have the same conditional distribution, on subsets of the sample space. We show that 
this "coupling optional Polya tree" prior provides a convenient and effective way for 
both the testing of two sample difference and the learning of the underlying structure 
of the difference. In addition, we discuss some practical issues in the computational 
implementation of this prior and provide several numerical examples to demonstrate 
its work. 
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1 Introduction 



Two sample comparison is a fundamental problem in statistics. With two samples of data at 
hand, one often wants to answer the question — "Did these two samples come from the same 
underlying distribution?" In other words, one is interseted in testing the null hypothesis 
that the two data samples were generated from the same distribution. Moreover, in the 
presence of evidence for deviation between the two samples, one often hopes to learn the 
structure of such difference in order to understand, for example, what factors could have 
played a role in causing the difference. Hence two sample comparison is interesting both as 
a hypothesis testing problem and as a data mining problem. In this work, we consider the 
problem from both aspects, and develop a Bayesian nonparametric approach that can serve 
both the testing and the learning purposes. 

Nonparametric hypothesis testing for two sample difference has a long history and rich 
literature, and many methods have been proposed. Some well-known examples include 
Wilcoxon test [UJ p. 243], Kolmogorov-Smirnov test |H pp. 392-394] and Cramer- von-Mises 
test pp. Recently, this problem has also been investigated from a Bayesian nonparametric 
perspective using a Polya tree prior [TO] . 

Despite the success of these existing testing methods for one- dimensional problems, two 
sample comparison in multi-dimensional spaces remains a challenging task. A basic idea 
for many existing methods is to estimate the two underlying distributions, and then use 
a distance metric to measure the dissimilarity between the two estimates. Tests such as 
Kolmogorov-Smirnov (K-S) and Cramer-von-Mises (CvM) fall into this category. However, 
reliably characterizing distributions in multi-dimensional problems, if computationally fea- 
sible at all, often requires a prohibitively large number of data points. With even just a 
moderate number of dimensions, the estimated distributional distance is often highly vari- 
able or biased. This "curse of dimensionality" demonstrates itself in the Bayesian setting as 
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well. This is true even when the underlying difference is structurally very simple and can be 
accounted for by a relatively small number of dimensions in the space. 

One general approach to dealing with the curse of dimensionality when characterizing dis- 
tributions in a multi-dimensional space is to learn from the data a partition of the space that 
best reflects the underlying structure of the distribution(s). A good partition of the space 
overcomes the sparsity of the data by placing true neighbors together, and it reduces compu- 
tational burden by allowing one to focus on the relevant blocks in the space. Hence it can be 
very helpful in multi-dimensional, and especially high-dimensional, settings to incorporate 
the learning of a representative partition of the space into the inference procedure. Wong 
and Ma [I7J adopted this idea and introduced the optional Polya tree (OPT) prior as such 
a method under the Bayesian nonparametric framework. Through optional stopping and 
randomized splitting of the sample space, a recursive partitioning procedure is incorporated 
into the parametrization of this prior, thereby allowing the data to suggest parsimonious 
divisions of the space. The OPT prior, like other existing Bayesian nonparametric priors, 
deals with only one data sample, but as we will demonstrate in this paper, similar ideas can 
be utilized for problems involving more than one sample as well. 

Besides the difficulty in handling multidimensional problems, existing nonparametric 
methods for two sample comparison are also unsatisfactory in that they provide no easy 
way to learn the underlying structure of the difference should it exist. Tests such as K-S 
and CvM provide statistics with which to test for the existence of a difference, but does 
not allow one to characterize the difference — for example what variables are involved in 
the difference and how. One has to resort to methods such as logistic regression that rely 
on strong modelling assumptions to investigate such structure. Similarly, Bayes factors 
computed using nonparametric priors such as Dirichlet process mixture and the Polya tree 
prior also shed no light on where the evidence for difference has arisen. 

In this work, we introduce a new prior called "coupling optional Polya tree" (co-OPT) 
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designed for Bayesian nonparametric inference on the two sample problem. This new prior 
jointly generates two random distributions through a random-partitioning-and-assignment 
procedure similar to the one that gives rise to the OPT prior [T7]. The co-OPT framework 
allows both hypothesis testing on the null hypothesis and posterior learning of the distri- 
butional difference in terms of a partition of the space that "best" reflects the difference 
structure. The ability to make posterior inference on a partition of the space also enhances 
the testing power for multi-dimensional problems. 

This paper is organized as follows. In Section 2 we review the construction of the OPT 
distribution. In Section 3 we generalize the definition of the OPT distribution by replacing 
the "uniform base measure" (defined later) with a general absolutely continuous distribution, 
and show that this generalized prior can be used for investigating the goodness-of-fit of the 
data to the base distribution. In Section 4 we introduce the co-OPT prior and show how 
Bayesian inference can be carried out using this prior. In addition, we discuss the practical 
issues in implementing inference using this prior. In Section 5 we provide several numerical 
examples to illustrate inference on the two sample comparison problem using this prior and 
compare it to other methods. Then in Section 6 we present a method for inferring two 
common distributional distances, L\ and Hellinger, between the two sample distributions 
using a co-OPT prior and provide two more numerical examples. Section 7 concludes with 
a few remarks. 

We close this introduction with a few words on the recent development in the Bayesian 
nonparametric literature on related topics. In the past 15 years, several methods have 
been proposed for testing the one sample goodness-of-fit, in particular, for non-parametric 
alternatives against a parametric null. For some examples see [7J El IU [3j [9l [131 [16]. As for 
two sample comparison, Holmes et. al. [10] introduced a way to compute the Bayes factor 
for testing the null through the marginal likelihood of the data with Polya tree priors. Under 
the null, they model the two samples to have come from a single random measure distributed 
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as a Polya tree, while under the alternative from two separate Polya tree distributions. In 
contrast, our new prior allows the two distributions to be generated jointly through one 
prior even when they are different. It is this joint generation that allows both the testing of 
the difference and the learning of the structure simultaneously. There are other approaches 
to joint modeling of multiple distributions in the Bayesian nonparametric literature. For 
example, one idea is to introduce dependence structure into Dirichlet processes [12] • For some 
notable examples see [HJ IS1 IE], among many others. Compared to these methods based on 
Dirichlet processes, our method, based on the optional Polya tree, allows the resolution of the 
inference to be adaptive to the data structure and handles the sparsity in multidimensional 
settings using random partitioning [T7]. Moreover, our method allows direct inference on 
the distributional difference without relying on inferring the two distributions per se, making 
it particularly suited for comparison across multiple samples. This point will be further 
discussed in Section 4.2 and illustrated in the examples given in Sections 5 and 6. 

2 Optional Polya trees and Bayesian inference 

Wong and Ma [17] introduced the optional Polya tree (OPT) distribution as an extension to 
the Polya tree prior that allows optional stopping and randomized partitioning of the sample 
space Q, where Q is either finite or a rectangle in an Euclidean space. One can think of this 
prior as a procedure for generating random probability measures on Q that consists of two 
components — (1) random partitioning of the space and (2) random probability assignment 
into the parts of the space produced by the partitioning. 

We first review how the OPT prior randomly partitions the space. Let TZ denote a 
partition rule function which, for any subset A of Q, defines a number of ways to partition 
A into a finite number of smaller sets. For example, for Q = [0, 1] x [0, 1], the (coordinate- 
wise) diadic split rule TZ is that TZ(A)={ splitting A in the middle of the range of one of 
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the coordinates Xj, j = 1 or 2} if A is a non-empty rectangle and = otherwise. We call 
a rule function 7Z finite if V7L C VL, the number of possible ways to partition A, M(A), as 
specified by 7Z, is finite. In the rest of the paper, we will only consider finite partition rules. 
Let K j (A) be the number of children specified by the jth way to partition A under 71(A), 
and let A\ denote the ith child set of A in that way of partitioning. That is, A = ' A{ 

for j = 1, 2, . . . , M(A). We can write 71(A) as 

71(A) = {{A\, A\,... A^}, {Al A 2 2 ,...,A 2 K2 },... {<, <, . . . A%„}} = {{^}fj 1 } ^ , 

where for simplicity we suppressed notation by writing M for M(A) and K for K(A). 

A partition rule function 7Z does not specify any particular partition on Q but rather 
a collection of possible partitions over which one can draw random samples. The OPT 
prior samples from this collection of partitions in the following sequential way. Starting 
from the whole space A = Vl. If M(A) = 0, then A is not divisible under 7Z and we 
call A an atom (set). In this case the partitioning of A is completed. If M(A) > 0, that 
is, A is divisible, then a Bernoulli(p(A)) random variable S(A) is drawn. If S(A) = 1, 
we stop partitioning A. Hence S(A) is called the stopping variable for A, and p(A) the 
stopping probability. If S(A) = 0, A is divided in the J(A)ih of the M(A) available ways for 
partitioning A under 71(A), where J(A) is a random variable taking values 1,2, ... , M(A) 
with probabilities Xi(A), A 2 (A), . . . , Xm(A)(A) respectively, and J^fjf'' Xj(A) = 1. J(A) is 
hence called the (partition) selector variable, and X(A) = (X±(A), X 2 (A), . . . , Xm(a)(A)) the 
(partition) selector probabilities. If J(A) = j, we partition A into {A{, A\, . . . A J Kj , A A, and 
then apply the same procedure to each of the children. In addition, if A is reached from Q 
after k steps (or levels) of recursive partitioning, then we say that A\ is reached after k + 1 
steps (or levels) of recursive partitioning. (To complete this inductive definition, we say that 
the space fl is reached after steps of recursive partitioning.) The recursive partitioning 
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procedure naturally gives rise to a tree structure on the sample space. For this reason, we 
shall also refer to the sets A that arise during the precedure as (tree) nodes. 

The first question that naturally arises is whether this sequential procedure will eventually 
"stop" and produce a well defined partition on Q. Given that the stopping probability 
p(A) > 5 for some 5 and all A, this is indeed true in the following sense. If we let p be 
the natural measure on Q — the Lebesgue measure if Q is a rectangle in an Euclidean space 
or the counting measure if Q is finite, then p{T±) — > with probability 1, where Tf is the 
part of Q that is still not stopped after k steps of recursive partitioning. In other words, the 
partitioning procedure will stop almost everywhere on Q. 

The second component of the OPT prior is random probability assignment. The prior 
assigns probability mass into the randomly generated parts of the space in the following 
manner. Starting from A = Q, assign Q(A) = 1 total probability to A. If A is stopped or is 
an atom, then let the conditional distribution within A be uniform. That is, Q(-|^4) = 
where u denotes the uniform density (w.r.t. p) and this completes the probability assignment 
on A. If instead A has children {A{, A 2 , . . . A J KJ ^}, (this occurs when S(A) = and J{A) = 
j,) a random vector (8{(A), 2 (A), . . . ,8 J KJ ^(A)) on the K^(A) — 1 dimensional simplex is 
drawn from a Dirichlet(a{(A), a^A), . . . , a J Kj ^(A)) distribution, and we assign to each child 
A\ probability mass Q(A\) = Q(A)6l(A). We call 6 j (A) = {6{{A), 6 J 2 {A), 6 3 KJ(A) (A)) the 
(probability) assignment vector, and a? {A) = (a{(A), a 2 (A), . . . , a 3 K3 ^(A)) the pseudo- 
count parameters. Then we go to the next level and assign probability mass within each of 
the children in the same way. 

Theorem 1 in [T7] shows that if p{A) > 5 for some 5 > and all A, then with probability 
1 this random partitioning and assignment procedure will give rise to a probability measure 
Q on Q that is absolutely continuous with respect to fi. This random measure Q is said to 
have an OPT distribution with (partition rule 1Z and) parameters p, A and a, which can 
be written as 0PT(1Z; p, A, a). In addition, Wong and Ma p2] also show that under mild 
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conditions, this prior has large support — any L x neighborhood of an absolutely continuous 
distribution (w.r.t. //) on Q has positive prior probability. 

Two key features of the prior are demonstrated from the above constructive description. 
The first is self-similarity. If a set A is reached as a node during the recursive partitioning 
procedure, then the continuing partitioning and assignment within A, which specifies the 
conditional distribution on A, is just an OPT procedure with Q = A. The second feature is 
the prior's implicit hierarchical structure. To see this, we note that the random distribution 
that arises from such a prior is completely determined by the partition and assignment 
variables S, J, and 0, while the prior parameters p, A and a. specify the distributions of 
these "middle" variables. 

These two features allow one to write down a recursive formula for the likelihood under 
a random distribution arising from such a prior. To see this, first let Q (with density q) be a 
distribution arising from an OPT(1Z; p, A, a) distribution, and for Acfl, let q(-\A) be the 
conditional density on A. Let S, J, and be the corresponding partition and assignment 
variables for Q (or q). Suppose one has n i.i.d. observations, xi, x 2 , x n , on Q from 
q(-\Q). Define 

x(A) = {x!,x 2 , • • • ,x n } n A, 

the observations falling in A, and n(A) = \x(A)\, the number of observations in A. Then 
for any node A reached in the recursive partitioning process determined by the S and J 
variables, the likelihood of observing x(A) conditional on A is 

q(x(A)\A) = Su(x(A)\A) + (1-S) |n (0/) n(A/) j (f[q(x (Aj) \A^ , (2.1) 

where u(x(A)\A) = ^ A * n(A) is the likelihood under the uniform distribution on A, S = S(A), 
J = J (A), K J = K J ^(A), and Qf = 9- {A) (A). (Note that for this formula to hold we need 
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to define q($\A) := 1.) From now on we will always suppress the "(A)" notation for the 
random variables and the parameters where this adds no confusion. Similarly, we will use 
q(x\A) and u(x\A) to mean q(x(A)\A) and u(x(A)\A), respectively. 

Integrating out S, J, and 6 in (12. ip . we get the corresponding recursive representation 
of the marginal likelihood 



P{x\A) = pu{x\A) + (l-p) y £ A; U P M) » ( 2 -2) 

j=l { > i=l 

where P(x\A) = P(x(A)\A), = n j {A) = (n(A{), n(A{), . . . , n(A j K3{A) )), and D(t) = 
r(ti) . . . T(tk)/T(ti + • • • + tk). Wong and Ma p2] provide terminal conditions so that ( 12. 2ft 
can be used to compute the marginal likelihood conditional on A, P(x\A), for all potential 
tree nodes A determined by 1Z. 

The final result we review in the section is the conjugacy of the OPT prior. More 
specifically, given the i.i.d. observations x, the posterior distribution of Q is again an OPT 
distribution with 

1. Stopping probability: p(A\x) = p(A)u(x\A) / P(x\A) 

2. Selection probabilities: 

Xj(A\x) oc XM) [ D( 2 3) II P ( x \^) for j = 1, . . . , M(A) 

\ ' i=l 

3. Probability assignment pseudo-counts: a{(A\x) = a{(A) + n(A\) 
for j = 1, . . . , M(A) and i = 1, 2, ... , K j (A) 

where again A is any potential node determined by the partition rule function 1Z on Q. 
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3 Optional Polya trees and 1-sample "goodness-of-fit" 

In the constructive procedure for an OPT distribution described above, whenever a node 
A is stopped, the conditional distribution within it is generated from that of a baseline 
distribution, namely the uniform u(-\A). For this reason, we say that the collection of 
conditional uniform distributions, : A is a potential node under 71} , are the local 

base measures. With uniform local base measures, the stopping probability p for a region A 
represents the probability that the distribution is "flat" within A. Accordingly, the posterior 
OPT concentrates probability mass around partitions that best captures the "non-flatness" 
in the density of the data distribution. Such a partitioning criterion is most natural in the 
context of density estimation. 

One can extend the original OPT construction by adopting different local base measures 
or stopping criteria for the nodes. More specifically, we can replace with any absolutely 

continuous measure m A {-) on node A in the probability assignment step. That is, when a tree 
node A is stopped, we let the conditional distribution in A be m A (-). With this generalization, 
the recursive constructive procedure for the OPT distribution and the recipe for Bayesian 
inference described in the previous section still follow through. 

One choice of the m A measures is of particular interest. Specifically, we can let m A {-) = 
qo(-\A) for some absolutely continuous distribution Q with density go on Q. For this special 
case, we have the following definition. 

Definition 1. The random probability measure Q that arises from the random-partitioning- 
and- assignment (RPAA) procedure described in the previous section, with u replaced by 
qo, the density (w.r.t. /i) of an absolutely continuous distribution Q Q , is said to have an 
optional Polya tree distribution on 1Z with parameters A, ex, p, and (global) base measure 
(or distribution) Q . We denote this distribution by OPT(1Z\ A, a, p\ Q ). 

The next theorem shows that by choosing an appropriate partitioning rule 1Z and/or 
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suitable pseudocount parameters a, one can enforce the random distribution Q to "center 
around" the base measure Q . 

Theorem 1. If Q ~ 0PT(1Z; p, A, (x; Q ), where 5 < p(A) for some 5 and all potential tree 
nodes A, then V Borel set B, 

EQ(B)=Q (B), 

provided that for all A, j = 1,2, ... , M(A) and i = 1,2, ... , K^{A), we have 

KHA) 

°4(A)/ £ <^ = Qo(M)/Qo(A). 
h=i 

Proof. See supplementary materials. □ 

Remark: If we have equal pseudocounts, that is, a{(A) = a J 2 (A) = ■ ■ ■ = a J KJ ^(A) for all 
potential nodes A and all j, then the condition for the theorem becomes Qq{A\)/Qq{A) = 
l/Ki{A). Therefore one can choose a partition rule TZ on fl based on the base measure to 
center the prior. 

Bayesian inference using the OPT prior with a general base measure can be carried out 
just as before, provided we replace u(x\A) replaced by ^(^l^) everywhere. An important 
fact is that a random distribution with this prior has positive probability to be exactly the 
same as the base distribution. Therefore, one can think of the inferential procedure for the 
OPT prior as a sequence of recursive comparison steps to the base measure. More specifi- 
cally, the partitioning decision on each node A is determined by comparing the conditional 
likelihood of the data within A under Qo to the composite of M(A) composite alternatives. 
The partition of each node A stops when the observations in A "fits" the structure of the 
base measure, and the posterior values of the partitioning variables capture the discrepancy, 
if any, between the data and the base. Consequently, this framework can be used to recur- 
sively test for 1-sample goodness- of-fit and to learn the structure of any potential "misfit". 
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For each node A, the posterior stopping probability is the probability that the data distri- 
bution coincides with the base distribution conditional on A. In particular, the posterior 
stopping probability for the whole space Q, p(£i), measures how well the observed data fit 
the base overall. The posterior values of the other partitioning and pseudocount variables 
reflect where and how the data distribution differs from the base. 

4 Coupling optional Polya trees and two sample comparison 

In this section we consider the case when two i.i.d. samples are observed and one is interested 
in testing and characterizing the potential difference between the underlying distributions. 
From now on, we let Q\ and Q2, with densities q\ and q 2 , be the two distributions from 
which the two samples have come from. 

4.1 Coupling optional Polya trees 

A conceptually simple way to compare Qi and Q 2 is to proceed in two steps — first estimate 
the two distributions separately, and then use some distance metric to quantify the difference. 
For example, one can place an OPT prior on each of Q\ and Q2 and use the posteriors to 
estimate the densities [IZj . (Other density estimators can also be used for this purpose.) 
With the density estimates available, one can then compute standard distance metrics such 
as L\, and in turn use this as a statistic for testing the difference. (This approach provides 
no easy way to characterize how the two distributions are different.) 

However, this two-step method is undesirable in multidimensional, and especially high- 
dimensional, settings. The main reason is that reliably estimating multidimensional distri- 
butions is a very difficult problem, and in fact often a much harder problem than comparing 
distributions. This difficulty in turn translates into either high variability or large bias in the 
distance estimates, and thus low statistical power. Using this approach, one is essentially 
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making inference on the distributional difference indirectly, through the inference on a large 
number of parameters that characterize the two distributions per se but have little to do 
with their difference. 

Following this reasoning, it is favorable to make direct inference on "parameters" that 
capture the distributional difference. But such direct inference requires (from a Bayesian 
perspective) that the two distributions be generated from a joint prior. This prior should be 
so designed that in the corresponding joint posterior, information regarding the distributional 
difference can be extracted directly. We next introduce such a prior. 

Our proposed method for generating the two distributions Q\ and Q 2 is again based on 
a procedure that randomly partitions the space Q and assigns probability masses into the 
parts, similar to the one that defines the OPT prior. What differs from the procedure for 
the OPT is that we add in an extra random component — the conditional coupling of the 
two measures Qi and Q 2 within the tree nodes. We next explain this construction in detail. 
Starting from the whole space A = Q, we draw a random variable 

C(A) ~ Bernoulli^ (A)), 

which we call the coupling variable. If C(A) = 1, then we force Qi and Q 2 to be coupled 
conditional on A — that is, Qi(-\A) = Q 2 (-\A) — and we achieve this by generating a common 
conditional distribution from a Stanford OPT on A. That is 

Q 1 (-\A)=Q 2 (-\A)~OPT\ A (K;p, A 6 , a b ), 

where the "b" superscript stands for "base", and the U \A" notation should be understood 
as the restriction to A of the partition rule 7Z, the stopping variables p, the partition selec- 
tor variables A 6 , and the assignment pseudo-count variables ac b . (For A = Q, there is no 
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restriction.) If C(A) = 0, then we draw a partition selector variable 

J (A) 6 {1, 2, . . . , n] with P(J(A) = j) = X j (A). 

If J (A) = j, then we partition A under the j'th way according to 1Z(A). Then draw two 
independent assignment vectors 

e{{A) = (9HA), 9{ 2 (A), e{ Kj{A) (A)) ~ Wld^ti), a{ 2 (A), a{ Kj{A) (A)) 
Oi(A) = (^(A), 4 2 (A), . . .,el KHA) (A)) ~ Dirichlet{o? 21 {A), a 22 {A), . . . , «4 J(A) (A)), 

and let 

Q 1 (i4j) = g 1 (A)^ i (i4) and Q 2 (A|) = Q a (i4)^(A) 

for each child of A. We call and #2 (.4) the assignment vectors for Q\ and Q2 (in 

the uncoupled state). Then we go down one level and repeat the entire procedure for each 
A\ , starting from the drawing of the coupling variable. 

Again, the first natural question to ask is whether this procedure will actually stop and 
give rise to two random probability measures {QiiQ-z}- The answer is positive under very 
mild conditions, and this is formalized in Theorem [2j The statement of the theorem uses 
the notion of "forced coupling", which is similar to the idea of "forced stopping" used in 
the proof of Theorem [1] and which we describe next. Let (Qi ,Q<p) denote the pair of 
random distributions arising from the above random-partitioning-and-assignment procedure 
with forced coupling after /c-levels of recursive partitioning. That is, if after k levels of 
partitioning a node A is reached and the two measures are not coupled on it, then force 
them to couple on A and generate Q[ k) (-\A) = Q 2 \-\A) from OPT\ A (1Z; p, X b , a b ). We do 
this for all such nodes to get (Q[ k \ Q^). 

Theorem 2. In the random-partitioning- and- assignment procedure for generating a pair of 
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measures described above, if r y(A),p(A) > S for some 5 > and all potential nodes A defined 
by the partition rule 1Z, then with probability 1, (Qi,Q>2^) converges to a pair of absolutely 
continuous (w.r.t. p) random probability measures {Qi,Q2) in the following sense. 



sup EeB \Q^(E) - Q X (E)\ + \QF(E) - Q 2 (E)\ 0, 



where B is the collection of Borel sets. 

Definition 2. This pair of random probability measures (Qi, Q 2 ) is said to have a coupling 
optional Polya tree (co-OPT) distribution with partition rule TZ, coupling parameters A, oci, 
a 2 , 7, and base parameters A 6 , a b , p, and can be written as co-OPT(lZ] A, oci, 0:2, 7; A b , et b , p). 



Similar to the OPT prior, the co-OPT distribution has large support under the L± metric. 
This is formulated in the following theorem. 

Theorem 3. Let Q be a bounded rectangle in MP. Suppose that the condition of Theorem^ 
holds along with the following two conditions: 

(1) For any e > 0, there exists a partition of the sample space allowed under 1Z, Q = \J* =1 Ai, 
such that the diameter of each node A4 is less then e. 

(2) The coupling probabilities ^(A), stopping probabilities p(A), coupling selector probabil- 
ities \i{A), base selection probabilities X b j(A), as well as the assignment probabilities 



<(A)/(Ei4i( A ))> 4(^)/(E*4(^)), and o!?{A)/ oc^ j (A)) for all i, j and all 



potential elementary regions are uniformly bounded away from and 1 . 

Let qi = dQi/dp and q 2 = dQ 2 /dp, then for any two density functions f\ and f 2 , and any 
t > 0, we have 



Proof of Theorem [3 See supplementary materials. 



□ 



P 




qi(x) — fi(x)\dp < t and / \q 2 (x) — f 2 (x)\dp < r > 0. 
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Proof. See supplementary materials. 



□ 



4.2 Bayesian inference on the two sample problem using co-OPT prior 

We next show that the co-OPT prior is conjugate and introduce the recipe for making in- 
ference on the two sample problem using this prior. Suppose (Qi,Q 2 ) is distributed as 
co-OPT(1Z; A, ai, ct 2 , 7; \ b , ot b , p), and we observe two i.i.d. samples X\ = (xn, xi 2 , ■ ■ ■ > %im) 
and x 2 = (x 2 i,x 22 , . . . ,x 2n2 ) from Qi and Q 2 respectively. For a node A reached dur- 
ing the random partitioning steps in the generative procedure of (Qi,Q2), let Xx(A) = 
{xn, Xu, ■ ■ ■ , Xi ni } H A and x 2 (A) = {x 2 i,x 22 , . . . , x 2n2 } fl A be the observations from the 
two samples in A, and let rii(A) = \xi(A)\ and n 2 (A) = \x 2 (A)\ be the sample sizes in A. 
As before, we let qi and q 2 denote the densities of the two distributions and let q$ denote 
the density of the random local base measure Qq . 

The likelihood of Xi(A) on A under and that for x 2 (A) under q 2 (-\A) are 

' qi ( Xl \A) = cq£{ Xl ) + (1 - c) nfJM) MAi) Qi(*Mf) 

(4.1) 

k q 2 (x 2 \A) = Cq${x 2 ) + (1 - (J) ]\tMd MAi) <l2{x 2 \Ai) 

where we have again suppressed the "(A)" notation for C(A), J (A), K(A) J{ - A \ e J ^ A \A), 
9 2 i(A), Xi(A) and x 2 (A). The joint likelihood of observing Xi(A) and x 2 (A) conditional 
on A is 

K J 

qi ( Xl \A) q2 (x 2 \A) = Cq^x u x 2 ) + (1 - C) n(^) ni(AJ) (^)" 2(Af) gi(^il4 J )g 2 (^l4 J ), 

(4.2) 

where qo(xi, x 2 ) = q^x^qQ^x^ is the standard OPT likelihood for the combined sample 
x(A) = (cCi(A), x 2 (A)) on A given by fl2TT|) . Integrating out q£, C, J, 6{ and Q J 2 from fl4~2j) . 
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we get the conditional marginal likelihood 

P( Xl , x 2 \A) = jPo(*i, x,\A) + (1-7) E ^^t^^^ fl P(*> **\M)> 

(4.3) 

where n{ = (n h (A{),n h (A J 2 ), . . . ,n h {A 3 Kj ) and a{ = {a? hl {A) , a J h2 (A) , . . . } a 3 hK3 (A)) for = 
1,2, and Po(*i; #2 1^4) is the conditional marginal likelihood of the combined sample under 
a standard OPT as given by (12. 2p . Equation (14. 3 \ provides a recursive recipe for computing 
the marginal likelihood term P(x\, x 2 \A) for each potential tree node A. (Of course, for this 
recipe to be of use, one must also specify the terminal conditions for the recursion. We will 
discuss ways to specify such conditions in the next subsection.) 

From ( 14. 3 p one can tell that the posterior distribution of (Qi, Q 2 ) is still a co-OPT distri- 
bution through the following reasoning. The first term on the RHS of (14. 3p . , -fPo(xi, x 2 \A), 
is the probability (conditional on A being a node reached in the partitioning) of the event 

{Qi and Q 2 get coupled on A, observe X\{A) and x 2 (A)}. 

The second term, (1 — 7) Yl*f=i ~^~7TT~^ ^^n^ Tli=i P( x ii X 2\A), is the probability of 

J D(a 1 )D(a 2 ) 

{Qi and Q 2 are not coupled on A, observe X\(A) and x 2 (A)}. 
Each summand, (1 - 7) A, ^l^l^ USi P( 

x x ,x 2 \A), is the probability of 

{Qi and Q 2 not coupled on A, divide A in the jth way, observe X\(A) and x 2 (A)}. 

Finally, given that C(A) = and J (A) = j, the posterior distribution for 6\ and 6 2 are 
Dirichlet(n{ + a{) and Dirichlet(n2 + ac 2 ), respectively. This reasoning, together with The- 
orem 3 in [17] . shows that the co-OPT prior is conjugate, and simple applications of Bayes' 
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Theorem provide the formulae of the parameter values for the posterior. The results are 
summarized in the next theorem. 

Theorem 4. Suppose X\ = (xn, X\ 2 , . . . , Xi ni ) and x 2 = (X21, X22, • • • , xi n2 ) are two indepen- 
dent i.i.d. samples from Q 1 and Q^- Let (Qi,Q 2 ) have a co-OPT(1Z; A, 0:1,0:2,7; A 6 , o b , p) 
prior that satisfies the conditions in Theorem^ Then the posterior distribution of (Qi,Q 2 ) 
is still a coupling optional Polya tree with the following parameters. 

1. Coupling probabilities: 7 (A \ x 1, x 2 ) = , y(A)P (xi, X2\A)/P(xi, x 2 \A). 

2. Partition selection probabilities: 

\ s {A\x u X2) oc ^M) D{n{ t a ? D n {n ^^ fl P ^ *'\ A ), 3 = ^-, M(A). 

D{cx{)D{ol j 2 ) 

3. Probability assignment pseudo-counts: 

a{ i (A\xi,x 2 ) = ni(Ai) + al^A) and a 3 2i (A\x 1 ,x 2 ) = n 2 {Ai) + a J 2i (A), 

for j = 1,2,..., M(A) andi = 1, 2, . . . , K j (A). 
4- Base stopping probabilities: p(A\xi,x 2 ) = p(A)u(xi, X2\A)/P (xi, x 2 \A). 

5. Base selection probabilities: 

\>(A\ Xl , x 2 ) cx X b j(A) 1 ^ 2 + 1 J] P (x h x 2 \A), j = 1, 2, . . . , M(A). 

^ > i=l 

6. Base assignment pseudo-counts: a.j(A\x\, x 2 ) = n\{A{) + n2{A\) + a/ (A), 
for j = 1,2,..., M(A) andi = 1, 2, . . . , K j (A). 

Two remarks: (1) All of the posterior parameter values can be computed exactly using 
the above formulae, without the need of any Monte Carlo procedure. (2) The posterior 
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coupling parameters contain information about the difference between the two underlying 
distributions Qi and Q 2 , while the posterior base parameters contain information regarding 
the underlying structure of the two measures. This naturally suggests that if one is only 
interested in two sample comparison, one should only need the posterior distribution of the 
coupling variables, and not those of the base variables. This will become clear in Sections 5 
and 6 where we give several numerical examples. 

4.3 Terminal conditions 

As mentioned earlier, we need to specify the terminal conditions for the recursion used to 
compute P(x\, X2\A). Depending on the nature of Q and the prior specification, the recursion 
formula (14.31) can terminate in several ways as demonstrated in the following two examples. 

Example 1 (2 P contingency table). Let = {1, 2} x {1, 2} x • • • x {1, 2}. For any rectangle 
A in the table — a set of the form A\ x A2 x . . . A p with A±, A 2 , ■ ■ ■ , A p being non-empty 
subsets of {1, 2} — let k±, k 2 , ■ ■ ■ , &m(A) be the "intact" dimensions of A, that is A k] = {1,2} 
for j = 1,2,..., M(A). Let TZ be the diadic splitting rule that allows A to be divided into 
two halves on each intact dimension j. In our earlier notation, K(A) = {{A{, AfifJ^} , 
where A{ = A x x A 2 x • • • x A k ._i x {1} x A kj+ i x • • • x A p and A\ = A\ x A 2 x • • • x x 
{2} x A k . + i x • • • x A. p . Suppose two i.i.d. samples x% and x 2 are observed. Assume that 
{Qii Q2) has a co-OPT prior with the following prior parameter values for each rectangle A: 
Xj(A) = \){A) = j^-, aj(A) = a\ j (A) = \ for i = 1, 2 and j = 1, 2, . . . , M(A), and finally 
■y(A) = 70, p(A) = p , where 70 and po are constants in (0, 1). 

In this example, there are three types of terminal nodes for P (xi,x 2 \A) and they are 
given in Example 3 of [17J. By a similar reasoning, there are also three types of terminal 
nodes for P(xi, x 2 \A). 

1. If A contains no data point from either sample, P(xi,x 2 \A) = 1. 
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2. If A is a single table cell containing any number observations, P{x\, x 2 \A) = 1. 

3. A contains a single observation (from either sample). In this case, P(xi, x 2 \A) = 
2 -m(A) = i/n(A). To see this, first we let t M{A) = P(x h x 2 \A). By Example 3 in [T7], 
we have Pq(x\, ^2^) = 2~ M<yA \ Hence we have 

t M{A) = 102~ M{A) + (1 - 70) g fjfjj J " *M(A)-1 

= 7o2- M(A) + (1-7o)^m ( a)-i 

7u2 -^ (l -(l-7o)^) , A-7oV^ 
^ 70 l-(l-To) V 2 J 

= 2 -M(A) = l/^A). 

Example 2 (Rectangle in W). Let Vt = I\ x J 2 x • • • x Ip be a bounded rectangle in R p . Let 
7?. be the diadic partition rule such that for any rectangle A of the form A\ x A 2 x . . . A p 
with Ai, v4 2 , . . . , A p being non-empty subintervals of I±, I 2 , . . . , I p respectively, A can be 
divided in half in any of the p dimensions. Again, let x\ and x 2 be the two samples, 
and let {Qi,Q 2 ) have a co-OPT prior with the following parameters: Xj{A) = A^(A) = -, 
a{(A) = a* j (A) = |, 7(A) = 70 and p(A) = p , for all A,i = l, 2, and j = 1, 2, . . . , M(A). 
In this case there are two types of terminal nodes for P(xi, x 2 \A). 

1. A contains no observations. In this case, P(xi,x 2 \A) = 1. 

2. A contains a single observation (from either sample). Then P(xi,x 2 \A) = l/p(A). 
We skip the derivation of this as it is similar to that used for Case 3 in Example [TJ 

Note that in this example we have implicitly assumed that no observations, from either sam- 
ple, can be identical. With the assumption that Qi and Q 2 are absolutely continuous w.r.t. 
the Lebesgue measure, the probability for any observations to be identical is 0. However this 
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situation can occur in real data due to rounding. This possibility can be dealt with in our 
following discussion on technical termination of the recursion. 

Other than the "theoretical" terminal nodes given in the previous two examples, in real 
applications it is often desirable to set a technical lower limit on the size of the nodes to be 
computed in order to save computation. For instance, in the MP example, one can impose 
that all nodes smaller than 1/1000 of the space Q be stopped and coupled. That is to let 
'j(A) = p(A) = 1 by design for all small enough A. The appropriate cutoff threshold of 
the node size depends on the nature of the data, but typically there is a wide range of 
values that work well. For most problems such a technical constraint should hardly have any 
impact on the posterior parameter values for large nodes. It is worth emphasizing that for 
real- valued data, which are almost always discretized (due to rounding), such a constraint 
actually becomes useful also in preventing numerical anomalies. In such general rule 

of thumb is that one should always adopt a cutoff size larger than the rounding unit relative 
to the length of the corresponding boundary of the space. 

5 Numerical examples on two sample comparison 

We next provide three numerical examples, Examples |3] through to demonstrate inference 
on the two sample problem using the co-OPT prior. In these examples, the posterior coupling 
probability of Q serves as a statistic for testing whether the two samples have come from 
the same distribution, which we will refer to as the co-OPT statistic. In each example 
we compare our method to one or more other existing approaches, and in Example [5] we 
show how the posterior values of the coupling variables can be used to learn the underlying 
structure of the discrepancy between the two samples. 

For all these examples, we set the prior parameter values in the fashion of Examples [1] 
and 121 with 7o = Po = 0.5. In Examples [3] and SJ whenever the underlying distributions have 
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unbounded support, we simply use the range of the data points in each dimension to define 
the rectangle fl (As a referee pointed out, an alternative to using this data dependent 
support is to transform each unbounded dimension through a measurable map such as a 
cumulative distribution function. The choice of such maps will influence the underlying 
inference. Although we do not investigate this relation in the current work, it is certainly 
interesting and deserve further studies in future works.) Also, in these three examples we use 
1/1000 as the size cutoff for "technical" terminal nodes as discussed in the previous section. 

Example 3 (Two sample problem in K). We simulate the control and case samples under 
the following three scenarios. 

1. Locational shift: Sample 1 ~ Beta(4,6) and Sample 2 ~ 0.2 + Beta(4,6) with sample 
sizes rii = n 2 = 20. 

2. Local structure: Sample 1 ~ Uniform[0,l] and Sample 2 ~ 0.5 Beta(20,10) + 0.5 
Beta(10,20) with n x = n 2 = 30. 

3. Dispersion difference: Sample 1 ~ N(0,1) and Sample 2 ~ N(0,4) with n\ = ri2 = 40. 

We place a co-OPT prior on {QiiQ-z) as described in Example [2j (Because here there is 
only one dimension, there is no choice of ways to split.) We compare the ROC curves of 
four different statistics for testing the null hypothesis that the two samples have come from 
the same distribution — namely the Kolmogorov-Smirnov (K-S) statistic P, pp. 392-394], 
Cramer- von- Mises (CvM) statistic pQ, Cramer-test statistic [2], and our co-OPT statistic. 



The results are presented in the middle column of Figure 1 In addition, we also investigate 



the power of each statistic at the 5% level under various sample sizes, ranging from 10 



data points per sample to 60 per sample. (See the right column of Figure 1 ) Our co-OPT 
statistic behaves worse than the other three tests under the first scenario when there is a 
simple locational shift, better than the other tests for the second scenario, slightly worse 
than the Cramer test but better than the K-S and CvM tests under the last scenario. 
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Figure 1: Two simulated samples on M under three scenarios (rows) given in Example [3j 
Left panel: Density functions for the two samples. Middle panel: ROC curves for four test 
statistics. Right panel: Power vs. sample size — power (at the 5% level) is estimated from 
simulation under equal sample size (horizontal axis) of the case group and the control group. 



Example 4 (Two sample problem in ]R 2 ). We simulate two samples under four scenarios. 
1. Locational shift (uq = n\ = 50): 

Sample 1 ~ BN I 



^2 2 0^ 



V / 



and Sample 2 ~ BN 



v° 22 J 



o 

w 



2 2 



v° 2 V 



2. Subset shift (n = rii = 100): 
Sample 1 ~ BN 







v / 



0.3 2 
0.3 2 



and 
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Sample 2~0.8xM| 
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v / 



0.3 2 
0.3 2 



\ 



+ 0.2 x BN\ 







'0.3 2 




5 


I 0.3 2 ^ 



3. Dispersion difference (n = n\ = 50): 



Sample 1 ~ BN \ 



v / 



v° V 



and Sample 2 ~ BN\ 



f \ / 



v / 



0.5 2 
0.5 2 



4. Local structure (no = n\ = 50): 

(° 



Sample 1 ~ BN 



Sample 2 ~ 0.5x5iV 
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\-°- 5 / 
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0.4 2 

We compare four statistics that measure the similarity between two distributions — (1) the 
co-OPT statistic, (2) the Cramer test statistic [2], (3) the log Bayes factor under Polya tree 
(PT) priors in [TU], and (4) the posterior mean of the "similarity parameter" e given in the 
dependent Dirichlet Process mixture (DPM) prior proposed in [14J. The ROC curves are 



presented in Figure 2 Again, the co-OPT performs relatively poorly for a simple (global) 
locational shift, but performs resonably well under the other three scenarios. The PT method 
does not allow adpative partitioning of the space, and that appears to have cost a lot of power. 
On the other hand, the DPM method performs well under all but the subset shift scenario. 
Because the similarity parameter e captures the proportion of "commonness" between two 
distributions [H] , it does not capture well differences pertaining to only a small portion of the 
probability mass. Details about the prior specifications for the PT and DPM methods can 
be found in the supplementary material. Note that there may be alternative specifications 
that will lead to better performance of these methods for the current example. 

Our next example deals with retrospectively sampled data on a high- dimensional contin- 
gency table. In this example, we not only demonstrate the power of our method to test for 
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Figure 2: ROC curves for two samples on M 2 under the four scenarios given in Example HI 

two sample difference, but also show that the posterior co-OPT distribution can help learn 
the underlying structure of the difference. 

Example 5 (Retrospectively sampled data on a 2 15 contingency table). Suppose there are 
15 binary predictors X%, X%, ■ ■ ■ , X^, and there is a binary response variable Y, e.g. disease 
status, whose distribution is 



Y 



Bernoulli(0.3) if X 3 = 1 and X 7 = 1 
Bernoulli^.?,) if X 7 = and X 10 = 
Bernoulli(0.1) otherwise. 
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We simulate populations for joint observations of Xj's and Y of size 200,000 under two 
scenarios 

1. Xi,X 2 ,..., X 1S Bernoulli(0.5) 

2 . X\ , X2 , . . . Xg as a 

Markov Chain with X 1 ~ Bernoulli (0.5), and P(X t = X t -i\X t -i) = 
0.7, while Xg, X 10 , . . . X 15 ~ iA d Bernoulli(0.5) and are independent of Xi, . . . , X 8 . 

For each scenario, we retrospectively sample controls (Y=0) and cases (Y—l). Our interest 
is in (1) the power of our method in detecting the difference in the joint distribution of the 
predictor variables between the two samples, and (2) whether the method can recover the 
"interactive" structure among the three predictors X3, X-j and X\§. 

We place two different priors on {Q\,Q%) and compare their performance. The first 
is our co-OPT distribution with prior parameters being specified as in Example [TJ The 
second is a dependent Dirithlet prior inspired by [TJ]. Under this setup we write Qi and 
Q 2 as mixtures, Q\ = eH + (1 — e)H\ and Q 2 = eH + (1 — e)H 2 , where H represents the 
common part of Qi and Q 2 while Hi and H 2 the idiosyncratic portion. Under the prior, H , 
Hi and H 2 ~i.i.d Dirichlet(ctH) and e ~ Beta(a e ,b e ). For the hyperparameters, we chose 
olh = (0.5, 0.5, . . . , 0.5) — that is, each cell in the support of the Dirichlet receives 0.5 prior 
pseuodocount, and a e = b e = 3. We found that due to the sparsity of the table, restricting 
the prior to have a support over only the observed table cells rather than the entire table 
drastically improves the power. Therefore, this is what we do here. (More details about the 
prior specification and how MCMC sampling is used to draw posterior samples for this prior 
can be found in the supplementary materials.) 

Because e can be thought of as a measure of how similar Qi and Q 2 are, the mean of its 
posterior distribution can serve as a statistic (which we shall from now on refer to as the 
e-statistic) for testing the difference between the two. The ROC curves of the e-statistic, and 
that of our co-OPT statistic, for the two scenarios and different sample sizes are given in the 
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left and middle columns of Figure 3 For comparison, the right column of the figure gives 
the ROC curve for another statistic measuring two sample difference, namely the empirical 
L 2 distance between the two contingency tables corresponding to the cases and the controls. 
Note that in this example to achieve comparable performance the the e-statistic and the 
L 2 distance both require samples sizes 10 times as large as those for the co-OPT! This 
performance advantage of the co-OPT in this setting is probably due to (1) the adaptive 
partitioning feature and (2) the coupling feature, both of which help mitigate the difficulties 
caused by the sparsity of the table counts. Also interesting is the impact of the correlation 
among predictors on the power. For the co-OPT, the correlation structure in Scenario 2 
makes it harder to find a good partition of the space and therefore reduces power. On the 
other hand, the performance of the e-statistic, as well as that of the L 2 distance, is actually 
better for Scenario 2, as the correlation structure turns a marginal association (marginal 
w.r.t. the subspace of X%, Xj and X\q) into a joint one involving X\ through X\q. 

While the e-statistic and the L 2 distance can only serve for detecting the difference, 
the posterior co-OPT can also capture the underlying structure of the difference. We find 
that with about 500 data points in each sample for Scenario 1 and about 3500 data points 
in each sample for Scenario 2, the underlying structure can be accurately recovered using 
the hierarchical maximum a posteriori (hMAP) tree topology, which is a top-down stepwise 
posterior maximum likelihood tree. (The construction of the hMAP tree as well as the 
motivation to choose it over the MAP tree is discussed in detail in Section 4.2 of [17].) As 
one would expect, the correlation between the predictor variables makes it much harder 
to recover the exact interactive relation. A typical hMAP tree structure for the simulated 



populations with these sample sizes is given in Figure 4 We note that in general a sample of 
partition trees from the posterior distribution of the tree structure can be more informative 
than the hMAP tree, especially when the sample sizes are not large enough. We use the 
hMAP here as a demonstration for its ease of visualization. 
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ROC for co-OPT 



ROC for dependent Dirichlet 



ROC for empirical L2 




Figure 3: ROC curves of the co-OPT (left), the e (middle), and the empirical L 2 (right) 
statistics for the two scenarios given in Example [5] (first row for Scenario 1 and second row 
for Scenario 2). The sample sizes for e and L 2 are 10 times as large as those for the co-OPT. 




Figure 4: A typical hMAP coupling tree that recovers the underlying interactive structure. 
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6 Inference on distributional distances between two samples 

In some situations, one may be interested in a distance measure for the two sample distri- 
butions. For example, if we let d(Qi,Q 2 ) denote the distance between the two sample dis- 
tributions under some metric d, one may want to compute quantities such as P(d(Qi, Q 2 ) > 
T\xi, x 2 ) where T is some constant. This can be achieved if one knows the posterior distribu- 
tion of d(Qi, Q 2 ) or can sample from it. We next show that if (Qi, Q 2 ) arises from a co-OPT 
distribution, then for some common metrics, in particular L\ and Hellinger distances, it is 
very convenient to sample from the distribution of d(Qi,Q 2 ). 

As before, let Q\ and Q 2 (with densities q\ and q 2 respectively) be the two distributions 
of interest. Suppose (Q\, Q 2 ) have a co-OPT distribution, and so they can be thought of as 
being generated from the random-partitioning-and-assignment procedure introduced in the 
previous section through the drawing of the variables C, J, Ox, 2 , C b , J b and 6 b . Then we 
have the following result. 

Proposition 5. Suppose (Qi,Q2) has a co-OPT distribution satisfying the conditions given 
in Theorem^ Let A(C,J) denote the (random) collection of all nodes on which Qi and 
Q 2 first couple. (The notation indicates that it depends on the coupling variables C and 
J.) Also, let di 1 be the L\ distance, and d^ the squared Hellinger distance. (That is, 
d Ll {f,g) = f\f-g\ andd H 2(f,g) = JiVJ-y/g) 2 .) Then 

d Ll {Qi,Q2)= E \QM)-Q 2 {A)\ 

A£A(C,J) 
A€A{C,J) 



Proof. See supplementary materials. □ 
This proposition provides a recipe for drawing samples from the distributions of (Qi, Q 2 ) 

29 



and d,H 2 (Qi,Q2)- One can first draw the coupling variables C, J, 1 and 6 2 . Then use C 
and J to find the collection of nodes A(C, J), and use 1 and 2 to compute, for each 
A G A(C, J), the corresponding measures Qi(A) and Q 2 (A). Finally, one draw of d^ x (or 
d H 2) can be computed by summing |Qi(A) — Q 2 (A)\ (or (^Qi(A) — ^Q 2 (A)) 2 ) over all 
nodes in A(C, J). 

A particularly desirable feature of this procedure for sampling L\ and Hellinger distances 
is that one does not need to draw samples for the two random distributions Qi and Q 2 to get 
their distances. In fact, one only needs to draw the coupling variables, which characterize the 
difference between the two distributions, without having to draw the base variables, which 
characterize the fine structure of the two densities. Again, in mult i- dimensional settings 
where estimating densities is difficult, such a procedure can produce much less variable 
samples for the distances. 

We close this section with two more numerical examples, one in M. and one in M 2 . In the 
second of these, again we use the observed range of the data in each dimension to define the 
space fl Also, we use 1/10000 as the size cutoff for technical termination. 

Example 6 (Two beta distributions). We simulate two samples from Beta(2,5) and Beta(20,15) 
under three sets of sample sizes n\ = n 2 =10, 100 and 1000. We place a co-OPT prior on the 
two distributions with the diadic partition rule and the symmetric parameter values as spec- 
ified in Example [2] with Po = 7o = 0.5, and compute the corresponding posterior co-OPT. 
Then we draw 1000 samples for each of dL 1 (Qi,Q 2 ) and dH?{Qi,Q2) from their posterior 



distributions. The histograms of these samples are plotted in Figure 5, where the vertical 



lines indicate the actual Li and squared Hellinger distances between the two distributions. 

Example 7 (Bivariate normal and mixture of bivariate normal). We repeat the same thing 
as in the previous example except now we simulate the two samples from the following 
distributions in IR 2 . 
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L1 n1=n2: 



L1 n1=i 



L1 n1=n2: 



!=1000 



0.0 0.5 1.0 1.5 2.0 



1.3 1.4 1.5 1.6 1.7 1.1 



1.45 1.50 1.55 1.60 



Hellinger n1=n2=10 



Hellinger n1=n2=100 



tlL 



Hellinger n1=n2=1000 



0.0 0.5 1.0 1.5 



0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 
Hellinger 



0.85 0.90 0.95 1.00 1.05 1.10 
Hellinger 



Figure 5: Histograms for posterior samples of L\ and squared Hellinger distances for two 
samples from Beta(2,5) and Beta(20,15). The vertical lines indicate the actual L\ and 
squared Hellinger distance between these two distributions. 



Sample 1 ~ BN I 



\ / 



, and 



Sample 2 ~ 0.5 x BN 



+ 0.5 x 5iV 



/-A 

v-v 



v° V 



Again we draw 1000 posterior samples for di 1 (Qi,Q2) and for du^iQi^Qi) under each 
set of sample sizes. The histograms of these samples are plotted in |Figure 6" , where the 
vertical lines again indicate the actual L\ and squared Hellinger distances between the two 
distributions. 
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L1, n1=n2=10 



L1, n1=n2=100 



L1, n1=n2=1000 



0.0 0.5 1.0 1.5 



0.4 0.6 0.8 1.0 1.2 



0.60 0.65 0.70 0.75 0.80 0.85 



Hellinger, n1=n2=10 



Hellinger, n1=n2=100 



LL 



0.0 0.2 0.4 0.6 0.8 1.0 1.2 
Hellinger 



0.1 0.2 0.3 0.4 0.5 
Hellinger 



Hellinger, n1=n2=1000 



0.20 0.25 0.30 

Hellinger 



Figure 6: Histograms for posterior samples of L\ and squared Hellinger distances for Exam- 
ple [0 The vertical lines indicate the actual L\ and squared Hellinger distances for the two 
underlying distributions. 



7 Concluding remarks 

In this work we have introduced the coupling optional Polya tree prior for Bayesian non- 
parametric analysis on the two sample problem. This prior jointly generates two random 
probability distributions that can "couple" on subsets of the sample space. We have demon- 
strated that this construction allows both the testing and the learning of the distributional 
difference between the two samples. One can easily extend this prior to allow the joint gen- 
eration of more than two samples. For example, if four samples are involved, then one can 
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draw four, instead of two, independent Dirichlet vectors for probability assignment on each 
uncoupled node. 

One interesting feature of the co-OPT prior (as well as the original OPT prior) is that the 
corresponding posterior can be computed "exactly" using the recursive formulation given in 
(14. 2 p without resorting to Markov Chain Monte Carlo sampling. However, such "exact infer- 
ence" based on recursions is still computationally intensive, especially in high-dimensional 
problems. Efficient implementation is a necessity for this method to be feasible for any 
non-trivial problems. However, even with the most efficient implementation, the exponen- 
tial nature of the method dictates that approximation techniques such as fc-step look-ahead 
as well as large-scale parallelization are needed for very high dimensional problems, such 
as those on a contingency table with 100 dimensions. Current work is undergoing in this 
direction. 
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Supplemental materials 

Appendices A.l includes all the proofs. A. 2 gives the details about the prior specifications 
for the comparison made in Example HI A. 3 gives the details about the specification of 
the dependent Dirichlet prior as well as the Gibbs sampler used for drawing posterior 
samples of e. 
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Appendix A.l. Proofs 



Proof of Theorem 1. Consider the RPAA procedure described in Section 2 with the uniform 
base distribution u replaced by go- So under this new procedure of generating a random 
measure Q, whenever a region A gets stopped, the conditional distribution of Q within A is 
set to be Qo(-\A). Let be the corresponding random distribution that is forced to stop 
after k levels of nested partitioning. In other words, for all non-stopped nodes A reached 
after k levels of nested partitioning, we stop dividing A regardless of the stopping variable 
S(A) and force a conditional distribution Q (-\A) on it to obtain Q^ k \ (For more detail see 
the proof of Theorem 1 in [TT].) 



We first show that if aj{A)/ Y%L[ A) 4X A ) = Qo(M)/Q (A), then EQ^{B) = Q {B) for 



levels of random partitioning — those are the nodes that are either just reached in the kth step 



For k = 0, = 0, A(jW) = {tt}, and Q(°) = Q and so E (QW(B)\jW) = Q (B) holds 



trivially. Now for k > 1, suppose this holds true for 1, 2, . . . , k — 1. By construction, 



all k. For k > 0, let be the collection of all partition random variables S and J drawn 
in the first k levels of partitioning, and let A(J^) be the collection of all leaf nodes after k 



or are reached earlier but stopped. We prove by induction that E [Q^ k \B)\J^ k ^ = Q (B). 




AeA(j( k )) 




Qo(BnA) 
Qo(A) 



Let A p E A(J^ k 1 * > ) be the parent node of A, that is, the node whose division gives rise to 
A. Then by the condition that a\ \A)j Y!h=\ ] °i( A ) = Qo(M)/Qo( A )> we have 



E{QW(A)l<p\A*)\jW) = Q (A)/Q (An, 
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and so 



AeA(jW) V ; \ 



J 



(k) 



J 



v Qo(BnA) f QW(A) (k) 



V- Qo(flnA) Qq(A) J n{k)(AP , 



J 



= Yl Qo(BnA)=Q (B). 

AeA(jW) 

This shows that E (Q {k) (B)\J^) = Q (B) and thus EQ^(B) = Q (B) for all k. But since 
\Q^ k \B) — Q(B)\ — > a.s. (see the proof of Theorem 1 in [IT]), by bounded convergence 
theorem, we have E\Q^(B) - Q(B)\ -> 0, and so = Q (B). □ 

Proof of Theorem 2. We first claim that with probability 1, and Q 2 respectively con- 
verge in total variational distance to two absolutely continuous random probability measures 
Qi and Q 2 , and thus for any Borel set E, 

\qM(e)-q 1 (e)\ + \q?\e)-q 2 (e)\ 

<sup EieB \Q?\E 1 )-Q 1 (E 1 )\ + sup E2eB \Q ( f\E 2 )-Q 1 (E 2 )\^0, w.p.l. 

To prove the claim, we note that the marginal procedure that generates Qi, for instance, is 
simply an OPT with random local base measures that arise from standard OPT distributions. 
To see this, we can think of the generative procedure of Q\ as consisting of the following two 



35 



steps. 

1. For each potential tree node A under 1Z, we draw an independent random measure Qq 
from OPT\ A {TL, Pl \\ct h ). 

2. Generate Qi from the standard random-partitioning-and-random-assignment proce- 
dure for an OPT, treating {C(A)} as the stopping variables, {J (A)} as the partition 
selector variables, and {d^ (A)} as the probability assignment variables, and with 
{Qq } being the local base measures. That is, when a node A is stopped, the condi- 
tional distribution Qx(-\A) is set to be Qq{-)- 

By Theorem 1 in [T7], for each potential node A, with probability 1, Qq is an absolutely 
continuous distribution. Because the collection of all potential tree nodes A under TZ is 
countable, with probability 1, this simultaneously holds for all Qq. Therefore, with proba- 
bility 1, the marginal procedure for producing Qi is just that for an OPT with local base 
measures {Qq}- The same argument for proving Theorem 1 in [IT] (with n(-\A) replaced 
by Qq (•)) shows that with probability 1, an absolutely continuous measure Qi exists as the 

(k) 

limit of Q\ in total variational distance. The same argument proves the claim for Q2 as 
well. □ 

Proof of Theorem 3. Because any density function on Q can be arbitrarily approximated in 
L\ by uniformly continuous ones, without loss of generality, we can assume that f\ and f2 
are uniformly continuous. Let 

5i(e) = sup - fi{y)\ and 6 2 (e) = sup \f 2 (x) - f 2 (y)\- 

\x-y\<e \x-y\<e 

By uniform continuity, we have 5j(e) j. as e J, for i = 1,2. Also, by Condition (1), for any 
e > 0, there exists a partition of Q = Uj =1 Ai such that the diameter of each Aj is less than 
e. By Condition (2), there is positive probability that this partition will arise after a finite 
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number of steps of recursive partitioning. Also because the parameters of the co-OPT are 
all bounded away from and 1, there is a positive probability that the A^s are exactly the 
sets on which Q\ and Q 2 first couple. Now let q Ai be the local base measure on each of A^ 
we can write 



i=i i=i 



?i(z) = ^QM^^UXx) and q 2 (x) = Q2{A i )q Ai (x)l Ai {x). 
Accordingly, 



= Y,f \Qi(A)q A *(x) - fi(x)\dfM(x) 
i=i Ja * 

<Y,QMi) I \q A '(x)-l/fi(A t )\dfi(x) + J2 f |QiWMA)-/i(x)|dM^) 

i=l "^i i=l 

i=l i=l i=l 

where f[ := f A f\{x)djj,{x) / fi(Ai) . By the exact same calculation we have 



1 

< " 



\Q2(x) - f 2 (x)\dfi(x) 

i=l i=l i=l 



where /| := f A f 2 (x)dfi(x) / fi(Ai) . By the choice of A i: we have that f A \f{ — fi(x)\d/i(x) < 
5 1 (e)n(A i ) and ^ |/« - f 2 (x)\d»(x) < <5 2 (e)//(A)- Thus, 

E/ \fi-fi(*)\Mx)<8i(e)iJL(n) and W I/2 ~ f2(x)\dfi(x) < 8 2 (e)n(n). 
i=i ^ »=i ^ 
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So by choosing e small enough, we can have 



max{5 1 (e),5 2 (e)}/i(fi) < r/3. 

Next, because all the coupling parameters of the co-OPT prior are uniformly bounded away 
from and 1, (conditional on the coupling partition) with positive probability, we have 



\Q 1 (Ai)-fi»(A i )\<-!--r and \Q 2 {A i )-fi^A l )\< 7 



for all z = 1,2, ... , /. Similarly, because all the base parameters are also uniformly bounded 
away from and 1, by Theorem 2 in [17], (conditional on the coupling partition and proba- 
bility assignments,) with positive probability we have 



/ \q Ai (x)-l/ii{Ai)\dn{x) < 



T 



3 • 2 i 



for all i = 1,2, ... ,1. Placing the three pieces together, we have positive probability for 
/ \Qi{ x ) ~ fi( x )\dfi < t and J |g 2 (^) — f2{x)\d[i < r to hold simultaneously. □ 

Proof of Proposition 5. We prove the result only for as the proof for d^i is very similar. 
(All following equalities and statements hold with probability 1.) 



d Ll {Qi,Q2) = / \qi(x) - q 2 {x)\n{dx) 



Y] / \qi(x) - q 2 (x)\fi(dx) + / 
AeA(c,J) jA Jn\uA(c,J) 



\Qi{x) ~ q2{x)\n(dx). 
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But for each A e A(C, J), due to coupling we have qi(-\A) = q 2 (-\A), and so 



L 



qi{x) - q 2 {x)\ii(dx) 



J A 



[ \Q 1 (A)-Q 2 (A)\q 1 (x\A) f i(dx) 



\Qx(A) - Q 2 (A)\. 



On the other hand, Q 1 (Q \ UA(C, J)) = Q 2 (Q \ UA(C, J)) = fi(Q \ UA(C, J)) = w.p.l. 
(See proof of Theorem 1 in [17].) Therefore, 



Appendix A. 2. Prior specifications for Example 4 

For the Polya tree two sample test [10] . we have imposed that each tree node is partitioned 
in the middle of both dimensions at each level. Therefore for our example in M 2 , each node 
has four children. We also impose that the prior pseudo-counts a are 0.5 for all children. 
The software used in this paper for this method is written by us. 

On the other hand, we used R package DPpackage function HDPMdensity to fit the Diri ch- 
ief Process mixture (DPM) model proposed in [T3j. More specifically, the two distributions 
are modeled as. 



where Hq models the common part of Fx and F 2 , whereas Hx and H 2 the unique parts. The 
parameter e captures the proportion of "commonnes" between the two distributions, and thus 



\Qi(A) - Q 2 (A)\. 



AeA(C,J) 



□ 



Fx = eif + (1 - e)H x 



F 2 = eH + (1 - e)H 2 , 
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can serve as a measure of how the two differ. Each of the Hi for % = 0, 1, 2 is modeled as a 
Dirichlet Process mixture of normals. 



The baseline distribution G is assumed to be Normal(fj, , S ). Following the example given 
by DPpackage, the (empirical) hyperprior specifications are 



e ~ 0.15 + 0.15i + 0.8 Unif[0, 1], 
at ~ Unif (0,1) for i = 0,1,2. 
So|/io, ~ InverseWishart(iiQ = 9,T = Var(?/)) 
/i |m , So ~ A^(m = mean(y), S = Var(y)) 

S|z/,T ~ InverseWishart(v = 9,T = 0.25Var(y)), 



where y is the combination of the two samples, mean(-) is the dimension-wise average, and 
Var is the covariance. The statistic we use to measure two sample difference (or similarity) 
is the posterior mean of e, estimated by the mean of the MCMC sample of size 10,000, with 
10,000 burn-in steps. 




where 



Gi\ai,G ~ DP(aii,Go). 
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Appendix A. 3. The dependent Dirichlet prior in Example 5 

Motivated by the hierarchical Dirichlet process mixture prior setup introduced in [14|, we 
can design the following prior for (Qi, Q2) on the finite support of a contingency table. 

Qi = eH + (1 - e)^ 
Q 2 = eH + (1 - e)H 2 

with 

H , Hi, H 2 ~i.i.d Dirichlet(ct H ) 
e ~ Beta(a e , b t ). 

We used ot H = (0.5, 0.5, . . . , 0.5) and a e = b t = 3 as the prior parameters. We found 
that restricting the support of an to the observed table cells rather than the entire table 
significantly improves the power of the method. This is due to the sparsity of the table 
counts — the vast majority of the table cells are empty. 

To draw posterior samples of e, we use the following Gibbs sampler. First some nota- 
tions. Let X 1 = {X{, X\, . . . , X^} and X 2 = {Xf , X\, . . . , X% 2 } denote the two sample 
observations. For each observation X 1 - in sample i = 1 or 2, we introduce a Bernoulli variable 
Jj that serves as an indicator for whether X\ has come for Hi or Hq. Given e, the Jj's are 
i.i.d. Bernoulli(e) variables. For simplicity, we denote (J\, J % 2 , . . . , J^.) as J 1 . Given J 1 we 
let 

X*'° = {X} : Jj = 0,j = 1, 2, . . . , m} and X 1 ' 1 = {Xj : Jj = l,j = 1, 2, . . . , nj 
for % = 1,2. In addition, we let n(X l ) be the table counts of of sample i in the support 
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of a H , and similarly define n(X !, °) and n(X hl ). With these notations, now we next write 
down the conditional distributions of H , H 1 , H 2 , e, J 1 and J 2 . 



We use this Gibbs sampler to draw posterior samples for e. We compute the posterior mean 
of e from 10, 000 samples with 10, 000 burn-in iterations. 
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